In multimodal tasks, visual language models (VLMs) play a crucial role, such as in image retrieval, image captioning, and medical diagnosis. These models aim to align visual data with language data for more efficient information processing. However, current VLMs still face significant challenges in understanding negation. Negation is critical in many applications, such as distinguishing between "a room without windows" and "a room with windows." Despite significant progress made by VLMs, existing models fall short when it comes to handling negated statements.